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Abstract. In this note we briefly describe our Cholesky modification 
algorithm for streaming multiprocessor architectures. Our implementa- 
tion is available in C++ with Matlab binding, using CUD A to utilise 
the graphics processing unit (GPU). Limited speed ups are possible due 
to the bandwidth bound nature of the problem. Furthermore, a com- 
plex dependency pattern must be obeyed, requiring multiple kernels to 
be launched. Nonetheless, this makes for an interesting problem, and 
our approach can reduce the computation time by a factor of around 
7 for matrices of size 5000 x 5000 and k — 16, in comparison with the 
LINPACK suite running on a CPU of comparable vintage. Much larger 
problems can be handled however due to the 0(n) scaling in required 
GPU memory of our method. 



1 Introduction and Problem Setting 

Given a symmetric positive definite matrix A £ R™ xn , for reasons of compu- 
tational efficiency and stability, it is often indispensable that we are able to 
maintain the upper triangular Cholesky factor L such that A = L T L — see [T] 
for a discussion. Frequently, A will changes by low rank modification during the 
course of an algorithm, hence it is imperative that we can accordingly modify the 
associated Cholesky factor L in an efficient and stable manner, in order to main- 
tain an optimal asymptotic time complexity. We focus on the following problem: 
given A, L, and a matrix V £ M. nxk , form the modified factor L such that 
L T L = A ± VV T = A. In particular we do so with 0(kn 2 ) operations, rather 
than by naively computing the modified matrix A and from there rebuilding the 
full Cholesky factor. This is referred to as the rank k Cholesky up (down) date 
when dealing with addition (subtraction) of VV T . Existing CPU implementa- 
tions such as dchud of LINPACK [3] typically treat the case k = 1. We allow 
k > 1 since this leads to more efficient memory access, although speedups are 
also obtainable with our algorithm for k = 1 (naturally this demands a larger 
problem size n, however). 



Algorithm 1 Choleskymodif yX 

Modify the Cholesky factor L £ R nxn by V G R", with <j G ±1 being positive 
(negative) to specify an update (downdate). 
// first alternative: 
function CholeskyModif yk(L, V, cr) 
for i = 1 to n do 

Compute(ci, L iti , Vi,o) 
for j = 1 to i — 1 do 

end for 
end for 

// second alternative: 
function CholeskyModif yB(L, V, cr) 
for i = 1 to n do 

Compute(ci, Si, Li,i, Vi,a) 
for j = i + 1 to n do 

Applj(ci,Si,Li :J ,Vj,a) 
end for 
end for 
// helper function: 
function Compute(ci, Sj, I/i,i, Vi, cr) 

Cj <— w/Li t t 

Si ^ Vi / Z/i^i 
ii,i U) 

// helper function: 
function Apply (cj, Si, Li,j, Vj, cr) 
Lij- <— (Lij +crSiV^)/ci 



2 Serial Algorithm 

The serial algorithm which we will adapt to the GPU is the so called hyperbolic 
approach which we state as Algorithm [1] for the case k = 1. 

3 Parallel Version 

Let us take stock of the memory accesses in the inner loop of the two possible 
orderings in Algorithm [1] 

— In CholeskyModif yA we read Cj,Sj and and write Ljj. The V, must 
only be read and written before and after the inner loop. 

— In CholeskyModif yB we read and write Lij and Vj. In this case, it is Cj and 
Si which need only be read and written before and after the inner loop. 



Hence we see that if reading and writing were equally costly, then the or- 
dering of CholeskyModif yA would be slightly better — not only that, but 
CholeskyModif yA also offers a rather natural mapping to the GPU shared and 
register memory of current hardware, as we shall see in the following subsection, 
so this is the approach we will employ. We make no claim as to the optimality 
of this approach — if the reader is aware of a superior approach, we would be 
interested to hear about it. 

4 Panelling 

A GPU kernel function which computes the inner loop of CholeskyModif yA 
would need to be launched n times. To avoid the overhead inherent in these 
repeated kernel launches we proceed by computing larger submatrices of L, se- 
quentially and either on the CPU or the GPU. The panelling strategy, which is 
illustrated in Figure [T] will be described in this section. 

4.1 Parameters 

The algorithm has the following parameters: 

1. BlocksPerKernel, which is 3 in figure Q] and 28 in our implementation. 

2. ThreadsPerBlock, which is 32 in our implementation and unspecified in the 
figure (but must equal n/(3 x BlocksPerKernel) as the figure depicts three 
diagonal chunks). 

3. Element sPerThread, which is 16 in our implementation and unspecified in 
the figure. This parameter corresponds to the number of columns of V which 
we process in each kernel call. That is, fc/ElementsPerThread successive 
kernel calls will be employed to process the entire update matrix V in batches 
of size ElementsPerThread. 

We now describe the roles of the CPU and GPU in their respective phases of the 
computation. We do not provide a detailed description but rather a high level 
overview which could serve as an aid in deciphering our C++ implementation. 

4.2 Panel Ordering 

The panels are processed in the order top-left grey: CPU; blue: GPU; middle 
gray: CPU; green: GPU; and bottom-right grey: CPU. 

4.3 CPU — On Diagonal Sub-matrices 

The grey blocks in the figure are of size BlocksPerKernel x ThreadsPerBlock, 
and are calculated on the CPU. This is trivially computed via Algorithm Q] 
combined with a loop over the ElementsPerThread update vectors. 



Fig. 1. A colour coding of the manner in which the panels of L are computed 
on either the CPU (grey panels) or the GPU (rest) — see section [4] for an 
explanation. 



4.4 GPU — Off Diagonal Sub-matrices 

Upload to the GPU Before launching the kernel we transmit the elements of 
the c and s vectors from the previous CPU submatrix, as well as the sub-matrix 
of L corresponding to the current GPU iteration. 

Compute on the GPU The GPU kernel is divided into BlocksPerKernel 

blocks (the threads in each of which may communicate via shared memory, as 
is the nature of the streaming multiprocessor architecture). Each GPU block 
computes a rectangular sub-matrix of L — these are marked by the tall, skinny 
red rectangles in the figure. Abstractly speaking, each block proceeds as follows: 

1. Load the elements of V corresponding to the columns of the current red 
rectangle into local per-thread registers. For each column of L we load 
ElementsPerThread elements of V — each thread handles one column of 
L. 

2. Iterate over the square ThreadsPerBlock x ThreadsPerBlock sized sub- 
matrices delineated by dotted lines: 

(a) Load the elements of c and s corresponding to the rows of the current 
square sub-matrix into per-block shared memory. Since we operate on 
batches of V, there will be ElementsPerThread per row. 

(b) Iterate over the rows of the square sub-matrix: 

i. read an element of L; 

ii. call the Apply function (ElementsPerThread times); 

iii. write that element of L back into global memory. 

3. Write the elements of V from the first step back to global GPU memory. 



Note that this algorithm requires twice as much shared memory as registers. 
Happily that is also the ratio of the amount of those memories available on the 
latest devices. 



Download from the GPU On completion of the kernel, we send the appro- 
priate submatrices of L and V back to host memory. 



5 Results 

To give the reader an idea of roughly what our GPU algorithm might buy them 
in terms of speedups, we present some basic timings in figures [2] and [3j The 
experimental procedure involves forming the matrices B £ M. nxn and V £ W ixk 
with elements drawn i.i.d. from the uniform distribution on [0, 1]. In the update 
test we let A = B T B + 7, where I is the identity matrix, and compute the 
Cholesky factor L using the LAPACK algorithm |3J and update it by V. For the 
downdating test we let A = B T B + I + VV T and downdate by V. Errors are 
calculated as maxy Ajj — Cij where C = L T L is computed with the BLAS 

and L is computed from L by up/down-dating L. The experiment is repeated for 
k = 16 and k = 1 in figures [5] and [31 respectively. For the CPU up/down-dating 
we used the LAPACK suite. 

The test system was a desktop machine running 64 bit Ubuntu linux with 
a 2.8GHz Intel i7 CPU and an Nvidia Tesla C2050 GPU with 14 streaming 
multiprocessors and a total of 448 cores and CUDA version 3.10. Note that 
the number of cores on the GPU is relatively unimportant however, due to the 
bandwidth bound nature of the problem. 

The experiments show that for k = 16, the GPU overtakes the CPU at 
around n — 2000, while for k = 1 we require at least n = 4000 in order to break 
even with the CPU. The errors are always very similar. Note that much larger 
problems can be handled since we only store 0(n) sized panels of L in device 
memory, unfortunately however our test system had rather limited host memory. 
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Fig. 2. Timings and errors for k = 16 on the CPU (red) and GPU (blue). 
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Fig. 3. Timings and errors for k = 
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(h) double precision downdate 
1 on the CPU (red) and GPU (blue). 



